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Genetic  algorithm,  Bayes  classifier 


i 

A  method  is  described  for  finding  decision  boundaries,  approximated 
by  piecewise  linear  segments,  for  classifying  patterns  in  9 lN ,  N  >  2,  us¬ 
ing  simulated  annealing.  It  involves  generation  and  placement  of  a  set  of 
hyperplanes  (  represented  by  strings  )  in  the  feature  space  that  yields  min¬ 
imum  misclassification.  Theoretical  analysis  shows  that  as  the  size  of  the 
training  data  set  approaches  infinity,  the  boundary  provided  by  the  sim¬ 
ulated  annealing  based  classifier  will  approach  the  Bayes  boundary.  The 
effectiveness  of  the  classification  methodology,  aJongwith  the  generaliza¬ 
tion  ability  of  the  decision  boundary,  is  demonstrated  for  both  artificial 
data  and  real  life  datasets  having  non-linear /overlapping  class  boundaries. 
Results  are  compared  extensively  with  those  of  the  Bayes  classifier,  k-NN 
rule  and  multilayer  perceptron,  and  Genetic  Algorithms,  another  popular 
evolutionary  technique.  Empirical  verification  of  the  theoretical  claim  is 
also  provided.  - - — 
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Abstract 

A  method  is  described  for  finding  decision  boundaries,  approximated 
by  piecewise  linear  segments,  for  classifying  patterns  in  N  >  2,  us¬ 
ing  simulated  annealing.  It  involves  generation  and  placement  of  a  set  of 
hyperplanes  (  represented  by  strings  )  in  the  feature  space  that  yields  min¬ 
imum  misclassification.  Theoretical  analysis  shows  that  as  the  size  of  the 
training  data  set  approaches  infinity,  the  boundary  provided  by  the  Sim¬ 
ulated  annealing  based  classifier  will  approach  the  Bayes  boundary.  The 
effectiveness  of  the  classification  methodology,  alongwith  the  generaliza¬ 
tion  ability  of  the  decision  boundary,  is  demonstrated  for  both  artificial 
data  and  real  life  data  sets  having  non-linear/overlapping  class  boundaries. 
Results  are  compared  extensively  with  those  of  the  Bayes  classifier,  k-NN 
rule  and  multilayer  perceptron,  and  Genetic  Algorithms,  another  popular 
evolutionary  technique.  Empirical  verification  of  the  theoretical  claim  is 
also  provided. 


1  Introduction 

Simulated  Annealing  (SA)  [1,  2,  3, 4]  belongs  to  a  class  of  local  search  algorithm. 
It  utilizes  the  principles  of  statistical  mechanics,  regarding  the  behaviour  of  a 
large  number  of  atoms  at  low  temperature,  for  finding  minimal  cost  solutions  to 
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large  optimization  problems  by  minimizing  the  associated  energy.  Let  E(q,  T") 
be  the  energy  at  temperature  T  when  the  system  is  in  the  state  q.  Let  a  new 
state  s  be  generated.  Then  state  s  is  accepted  in  favour  of  state  q  with  a 

probability  pqs  =  — ■ 

l+e  7 

In  statistical  mechanics  investigating  the  ground  states  or  low  energy  states  of 
matter  is  of  fundamental  importance.  These  states  are  achieved  at  very  low 
temperature.  However,  it  is  not  sufficient  to  lower  the  temperature  alone  since 
this  results  in  unstable  states.  In  the  annealing  process,  the  temper  at  we  is  first 
raised,  then  decreased  gradually  to  a  very  low  value  (Tmin),  while  ensuring  that 
one  spends  sufficient  time  at  each  temperature  value.  This  process  yields  stable 
low  energy  states. 

Pattern  classification  can  be  viewed  as  a  problem  of  search  and  placement  of 
a  number,  H,  of  hyperplanes  (fixed  a  priori)  which  can  model  the  decision 
boundary  of  the  given  data  set  appropriately.  The  criterion  to  be  minimized 
is  the  number  of  samples  of  the  given  training  data  that  are  misclassified  for  a 
particular  arrangement  of  the  H  hyperplanes.  The  arrangement  of  hyperplanes 
that  minimizes  the  number  of  misclassified  data  points  is  considered  to  provide 
the  decision  boundary  of  the  given  training  data  set. 

The  present  article  describes  a  methodology  demonstrating  the  searching  ability 
of  SA  for  finding  an  appropriate  arrangement  of  H  hyperplanes  that  minimises 
the  number  of  misclassified  points.  The  effectiveness  of  the  classifier  has  been 
adequately  established  for  several  artificial  and  real  life  data  sets  v  -th  both 
overlapping  and  non  overlapping  class  boundaries.  The  results  are  also  com¬ 
pared  with  a  similar  approach  [5]  based  on  genetic  algorithms  (GA)  (4,  6], 
Bayes  maximum  Ukelihood  classifier,  k-NN  rule  [7]  and  multilayered  perception 

(MLP)[8]. 

Besides,  a  theoretical  analysis  alongwith  an  empirical  verification  is  presented 
which  shows  that  for  the  size  of  the  training  data  set  going  to  infinity,  the 
SA  based  classifier  (or  SA  classifier)  will  provide  an  error  probability  of  the 
training  data  which  is  less  than  or  equal  to  the  Bayes  error  probability.  (In  this 
regard  it  may  be  mentioned  here  that  Bayes  maximum  likelihood,  classifier 
[7]  is  one  of  the  most  widely  used  statistical  pattern  classifiers  which  provides 
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optimal  performance  from  the  standpoint  of  error  probabilities  in  a  statistical 
framework.  It  is  known  to  be  the  best  classifier  when  the  class  distributions 
and  the  a  priori  probabilities  are  known.  Consequently,  the  desirable  property 
of  any  classifier  is  that  it  should  approximate  or  approach  the  Bayes  classifier 
under  limiting  conditions.) 

A  brief  discussion  on  the  principles  of  simulated  annealing  is  first  presented  in 
the  next  section.  This  is  followed  by  a  detailed  description  of  the  SA  classifier. 
The  theoretical  analysis  is  provided  in  Section  4  followed  by  the  implementation 
results  in  Section  5.  Finally,  the  discussion  and  conclusions  axe  presented  the 
last  section. 

2  Simulated  Annealing  :  Basic  Principles 


In  the  recent  past,  application  of  techniques  having  physical  or  natural  corre¬ 
spondence  for  solving  difficult  optimization  problems  has  received  widespread 
attention.  It  has  been  found  that  these  techniques  consistently  outperform 
classical  methods  like  gradient  descent  search  when  the  search  space  is  large, 
complex  and  multimodal.  Simulated  annealing  (SA)  is  one  such  paradigm  hav¬ 
ing  its  foundation  in  statistical  mechanics,  which  studies  the  behaviour  of  a 
very  large  system  of  interacting  components  in  thermal  equilibrium. 


In  statistical  mechanics,  if  the  system  is  in  thermal  equilibrium,  the  plia¬ 
bility  7 pt(s)  that  the  system  is  in  state  $,  s  6  S,  S  being  the  state  space,  at 
temperature  T,  is  given  be 


7rr(s) 


.=& 


e 


-E(  w) 

"IP 


(1) 


where  k  is  the  Boltzmann’s  constant  and  E(s)  is  the  energy  of  the  system  in 
state  s. 


Metropolis  et.al.  [9]  developed  a  technique  to  simulate  the  behaviour  of  the 
system  in  thermal  equilibrium  at  temperature  T  as  follows  :  Let  the  system  be 
in  state  q  at  time  t.  Then  the  probability  p  that  it  will  be  in  state  a  at  time 
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V 


t  +  1  is  given  by  the  equation 


=  ^r(£)  =  e=&pM  (2) 

xT(g) 

If  the  energy  of  the  system  in  state  s  is  less  than  that  in  state  g,  then  p  >  1  and 
the  state  s  is  automatically  accepted.  Otherwise  it  is  accepted  with  probability 
p.  Thus  it  is  also  possible  to  attain  states  with  higher  energy  values.  It  can  be 
shown  that  for  t  -»  oo,  the  probability  that  the  system  is  in  state  s  is  given  by 
irT(s)  irrespective  of  the  starting  configuration  [10]. 

Begin 

generate  the  initial  string  randomly  =  q 

T  =  Tmex 

Let  E(q,T )  be  the  associated  energy 
while  (T  >  Tmin) 
for  i  -  1  to  k 

Mutate  (flip)  a  random  position  in  q  to  yield  s 
Let  E(s,T )  be  the  associated  energy 
Set  q  <—  s  with  probability 
end  for 
T  =  rT 
end  while 

Decode  the  string  q  to  provide  the  solution  of  the  problem. 

End 

Figure  1:  Steps  of  Simulated  Annealing 

When  dealing  with  a  system  of  particles,  it  is  important  to  investigate  very  jow 
energy  states,  which  predominate  at  extremely  low  temperatures.  To  achieve 
such  states,  it  is  not  sufficient  to  lower  the  temperature.  An  annealing  schedule 
is  used,  where  the  temperature  is  first  increased  and  then  decreased  gradually, 
spending  enough  time  at  each  temperature  in  order  to  reach  thernaal  equilib¬ 
rium. 

In  this  article  we  have  used  the  annealing  process  of  the  Boltzmann  machine, 
which  is  a  variant  of  the  Metropolis  algorithm.  Here,  at  a  given  temperature 


4 


T,  the  new  state  is  chosen  with  a  probability 

1 

Pqs  —  — (Eh,n -E(»,D)  • 

1  +  e  r 


The  parameters  of  the  search  space  is  encoded  in  the  form  of  a  bit  string  of  a 
fixed  length.  The  objective  value  associated  with  the  string  is  computed  and 
mapped  to  its  energy.  The  string  with  the  minimum  energy  value  provides  the 
solution  to  the  problem.  The  initial  string  (say  q)  of  0  s  and  Is  is  generated 
randomly  and  its  energy  value  is  computed.  Keeping  the  initial  temperature 
high  (say  T  -  T^),  a  neighbour  of  the  string  (say  s)  is  generated  by  randomly 
flipping  one  bit.  The  energy  of  the  new  string  is  computed  and  it  is  accepted  in 
favour  of  q  with  a  probability  pq3  mentioned  earlier.  This  process  is  repeated  a 
number  of  times  (say  k )  keeping  the  temperature  constant.  Then  the  tempera¬ 
ture  is  decreased  using  the  equation  T  =  rT,  where  0  <  v  <  1,  and  the  k  loops, 
as  earlier,  are  executed.  This  process  is  continued  till  a  minimum  temperature 
(say  Tmin)  is  attained.  The  simulated  annealing  steps  are  shown  in  Fig.  1. 


3  Description  of  the  SA  classifier 


The  correspondence  between  the  physical  aspect  of  simulated  annealing  and 
an  optimization  problem  is  as  follows  :  the  parameters  of  the  search  space 
(in  this  case  the  H  hyperplanes),  are  encoded  in  strings  (usually  binary)  and 
these  represent  the  different  states;  low  energy  states  correspond  to  near  oj>? 
timal  solutions  (or  an  arrangement  of  the  hyperplanes  that  provide  minimum 
misclassification) ;  the  energy  corresponds  to  objective  function  (or  the  npmber 
of  misclassified  samples),  and  temperature  is  a  controlling  parameter  of  the 
system.  The  important  tasks  here  are  to  establish  a  way  of  representing  and 
generating  different  configurations  (or  states)  of  the  problem  and  an  annealing 
schedule.  These  are  now  discussed  in  details. 
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3.1  State/Hyperplane  Representation 

In  this  article,  binary  string  of  length  l  is  used  to  encode  the  parameters  of  the 
H  hyperplanes.  >From  elementary  geometry,  the  equation  of  a  hyperplane  in 
N  dimensional  space  (Xi  -  X2 - X„)  is  given  by 

xN  cos  +  /Vi  sin  £**_,  -  d  (3) 


where  /3W_,  =  xN.x  cos  aN_7  +  f3N.7  sinaw_2 

3n—  2  “  3CjV— 2  COS  CVv-  3  “1“  3 S -3  Sin  O.V_3 


A  =  xx  cos  a0  +  A  sin  ao 

The  various  parameters  are  as  follows  : 

Xi  :  the  i  th  feature  of  the  training  points. 

(sci,  x2, . . . ,  Xff)  :  a  point  on  the  hyperplane 

aN-i  :  the  angle  that  the  unit  normal  to  the  hyperplane  makes  with  the  XN 
axis. 

aN_7  :  the  angle  that  the  projection  of  the  normal  in  the  (Xi  —  X2  XN^) 

space  makes  with  the  XN.x  axis. 

ai  :  the  angle  that  the  projection  of  the  normal  in  the  (Xi  —  X2)  plane  makes 
with  the  X2  axis. 

a0  :  the  angle  that  the  projection  of  the  normal  in  the  (Xa)  plane  makes  with 
the  Xi  axis  =  0.  Hence,  /30  sin  a0  =  0. 

d  :  the  perpendicular  distance  of  the  hyperplane  from  the  origin. 

Thus  the  N  tuple  <  aua2,  ■  ■  ■  ,otN-Ud  >  specifies  a  hyperplane  in  N  dimen¬ 
sional  space. 

Each  angle  aj%  j  =  1, 2, . . .  ,N  - 1  is  allowed  to  vary  in  the  range  of  0  to  2y,  If 
bj  bits  are  used  to  represent  an  angle,  then  the  possible  values  of  olj  are 

0, 8  *  2ir,  28  *  2ir,  38  *  27 r, . . . ,  (2bl  —  1)5  *  2x 

where  8  =  Consequently,  if  the  b\  bits  contain  a  binary  string  having  the 
decimal  value  wl5  then  the  angle  is  given  by  Vi  *  8  *  2t. 
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Once  the  angles  are  fixed,  the  orientation  of  the  hyperplane  becomes  fixed.  Now 
only  d  must  be  specified  in  order  to  specify  the  hyperplane.  For  this  purpose  the 
hyper  rectangle  enclosing  the  training  points  is  considered.  Let  (rc™n,  £™oz)  be 
the  minimum  and  maximum  values  of  feature  Xi  as  obtained  from  the  training 
points.  Then  the  vertices  of  the  enclosing  hyper  rectangle  are  given  by 

(*5\ 

where  each  chi,  i  =  1, 2, . . . ,  N  can  be  either  max  or  min.  (Note  that  there  will 
be  2n  vertices.)  Let  diag  be  the  length  of  the  diagonal  of  this  hyper  rectangle 
given  by 


A  hyperplane  is  designated  as  the  base  hyperplane  with  respect  to  a  given 
orientation  (i.e.,  for  some  c*i,  0:2, ... ,  if 

i  :  it  has  the  same  orientation 

ii  :  it  passes  through  one  of  the  vertices  of  the  enclosing  rectangle 

iii  :  its  perpendicular  distance  from  the  origin  is  minimum  (  among  the  hy¬ 

perplanes  passing  through  the  other  vertices).  Let  this  distance  be  drain- 

If  62  bits  are  used  to  represent  d,  then  a  value  of  V2  in  these  bits  represents  a 
hyperplane  with  the  given  orientation  and  for  which  d  is  given  by  dTOi„  +  ^?*W2< 

Thus  a  string  is  of  a  fixed  length  of  l  =  H((N  —  1)  *  by  +  62),  where  H  =  the 
number  of  hyperplanes.  The  initial  string  is  generated  randomly. 

Note  that  we  have  used  this  recursive  form  of  representation  over  the  classical 
one  viz.  h  xi  + I2  £2  +  •  •  •  +  In  —  d,  where  l\, . . . , Zjv  are  kno~'n  as  the 
direction  cosines.  The  latter  representation  involves  a  constraint  equation,  tf  rb 
+  =  1.  This,  in  turn,  leads  to  the  complicated  issue  of  getting  invalic}  or 

unacceptable  solutions  when  the  constraint  equation  is  violated.  Howeyer,  the 
representation  that  we  have  chosen  avoids  this  problem  by  being  unconstrained 
in  nature.  =’ 
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3.2  Energy/ Objective  Value  Computation 

A  string  encodes  the  parameters  of  H  hyperplanes  as  described  earlier.  Using 
these  parameters,  the  region  in  which  each  training  pattern  point  lies  is  deter¬ 
mined  from  equation  (3).  A  region  is  said  to  provide  the  demarcation  for  class 
i,  if  maximum  number  of  points  that  lie  in  this  region  belong  to  class  z.  Other 
points  that  lie  in  this  region  are  considered  to  be  misclassified.  The  misclassi- 
fications  associated  with  all  the  regions  (for  these  H  hyperplanes)  are  summed 
up  to  provide  the  total  misclassification,  miss ,  for  the  string,  which  represents 
its  energy. 


3.3  New  State  Generation  Process  and  Annealing 
Schedule 

For  generating  a  new  configuration,  one  (or  more)  random  position(s)  in  the 
bit  string  is  chosen  and  flipped.  This  provides  a  new  string,  whose  energy  is 
computed  in  the  above  mentioned  manner. 

As  already  mentioned,  the  crucial  task  over  here  is  the  attainment  of  low  energy 
states,  obtained  at  very  low  temperatures.  If  the  temperature  is  decreased 
quickly,  then  the  low  energy  states  tend  to  be  unstable.  In  order  to  reach 
stable  states,  the  temperature  must  be  initially  increased,  and  then  decreased 
gradually  allowing  sufficient  time  at  each  temperature.  This  process  is  known 
as  annealing.  In  order  to  simulate  this  method,  initially  the  temperature  is 
kept  high  (=7^).  A  parameter  k  is  used  to  control  the  time  spent  at  each 
temperature  value.  The  temperature  is  decreased  according  to  the  formula 
T  —  rT,  where  0  <  r  <  1.  Higher  value  of  r  indicates  a  more  gradual  annealing 
schedule.  The  different  steps  of  the  SA  classifier  are  shown  in  Fig.  2.  The 
process  continues  until  either  a  string  with  no  misclassified  points  is  obtained 
(miss  —  0)  or  an  user  specified  minimum  temperature  value  (=T’m»n)  h>  -tteined, 
The  final  string  q  at  termination  provides  the  solution  to  the  problem. 
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4  Relationship  with  Bayes  Error  Probability 


In  this  section  we  study  the  theoretical  relationship  between  the  SA-classifier 
and  Bayes  classifier  in  terms  of  the  error  probabilities.  The  mathematical  no¬ 
tations  and  preliminary  definitions  are  described  first.  This  is  followed  by  the 
claim  that  for  n  — >  oo  the  performance  of  the  SA-classifier  will  no  way  be 
worse  than  that  of  Bayes  classifier.  Finally  some  critical  comments  about  the 
proof  are  mentioned. 

Let  there  be  k  classes  C\, C2, ...  ,Ck  with  a  priori  probabilities  Pi, P2, ...  ,Pk 
and  class  conditional  densities  P\(x),p2{x),  •  •  •  5 Pk(x)-  Let  the  mixture  density 

bc  * 

p(x)  =  S-P.p.W-  (4) 

i— 1 

Let  Xu  Xu ...,  Xn, ...  be  independent  and  identically  distributed  (i.i.d)  N  di¬ 
mensional  random  vectors  with  density  p(x).  This  indicates  that  there  is  a 
probability  space  (fl,P,  Q),  where  T  is  a  0  field  of  subsets  of  Cl,  Q  is  a  proba¬ 
bility  measure  on  T,  and 

Xi :  (Cl, T, Q )  —>  (3tN, B(?RN),P)  ,  Vi  =  1,2,... 

such  that 

P(A)  =  Q(V'(A)) 

=  J^p(x)dx 

VA£3(ftN)  and  Vi  =  1,2,.... 

Here  B(3tN)  is  the  Borel  0  field  of  ?RN . 

Let 

S  =  {E:E=(SuS2,...,Sk),SiCiRN,Si^<b 

T  —  I 

E  provides  the  set  of  all  partitions  of  into  k  sets  as  well  as  their  permutations, 
i.e., 
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E1  =  (S1,S2,S3...,Sk)££ 

E2  =  (S2,Si,S3,...,Sk)££ 

then  Ei  ^  E2.  Note  that  E  -  (5^,5^, . . . ,  Sik)  implies  that  each  Sip  1  <  j  <  k, 
is  the  region  corresponding  to  class  Cj. 

Let  Eq  =  (50i,  502, . . . ,  Sok)  €  £  be  such  that  each  S0i  is  the  region  corresponding 
to  the  class  Q  in  and  these  are  obtained  by  using  Bayes  decision  rule.  Then 

a  =  Y,  pi  [  Pt(x)  L  !*(*)  (5) 

i=l  JS0i  t=l  JSi> 

k)  £  £■  Here  a  is  the  error  probability  obtained  using  the 

Bayes  decision  rule. 

It  is  known  from  the  literature  that  such  an  Eq  exists  and  it  belongs  to  £ 

because  Bayes  decision  rule  provides  an  optimal  partition  of  and  for  every 

k 

such  Ei  =  (Sn,  S12, . . . ,  Sik)  €  £,  Y,pi  Ss'uPi(x)  provides  th  e  error  probability 

i=l 

for  Ei  £  £.  Note  that  E0  need  not  be  unique. 


Assumptions  :  Let  H0  be  a  positive  integer  and  let  there  exist  Ha  hyperplanes 
in  3tN  which  can  provide  the  regions  S'oi,  Sq2,  Sok.  Let  H0  be  known  a  priori. 
Let  the  algorithm  for  generation  of  class  boundaries  using  H0  hyperplanes  be 
allowed  to  be  executed  for  a  sufficiently  large  number  of  iterations  in  each 
temperature  value  and  for  sufficiently  low  temperatures.  Let  the  number  of 
strings  be  t  with  misclassification  values  missi,  miss2, . . . ,  misst  where  0  < 
missi  <  miss2  <  ...  <  misst.  Let  Pij\T)  denotes  the  probability  of  going 
from  string  i  to  string  j  in  nj  steps  with  the  temperature  value  T.  It  is  known 
in  the  literature  that  for  the  adopted  SA  algorithm 


/m\  — mi*  J.-/7 

where  p^T)  =  ^  ——jf 


n\—*oQ 

r.  It  follows  that 


limr_o+  Pi.j(T)  =  1  for  j  -  1 

=  0  for  j  7^  1.  Thus  it  is  known  that  using  SA  technique  of 
i)  making  rij  — *  oo  and  ii)  making  T  — *  0+,  one  can  get  the  optimal  string  and 
its  value. 


Let  A  =  {A  :  A  is  a  set  consisting  of  H0  hyperplanes  in  ?RN}.  Let  Aq  £  A 
be  such  that  it  provides  the  regions  Soi,  S02,  •  •  •  •  *Sofc  i-e->  ^0  provides 
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the  regions  which  are  also  obtained  using  the  Bayes  decision  rule.  Note  that 
each  A  €  A  generates  several  elements  of  £.  Let  £a  Q  £  denote  all  possible 
E  =  (Si,  S2,  •  •  • ,  Sk)  €  £  that  can  be  generated  from  A. 

Let  G  =  U  Sa 

AeA 

Let  ZiE(v)  =  1  if  Xi(uj)  is  misclassified  when  E  is  used  as  a  decision 
rule  where  E  6  G,  Vo;  €  f 1. 

=  0  otherwise. 

n 

Let  fnE{u)  =  when  E  G  G  is  used  as  a  decision  rule. 

Let  fn(u)  =  Inf{/n£(u>) :  E  €  G  }. 

It  is  to  be  noted  that  the  pattern  classification  algorithm  mentioned  in  Section 
II  uses  n  x  /„£(uj),  the  total  number  of  misclassified  samples,  as  the  objective 
function  which  it  attempts  to  minimize.  This  is  equivalent  to  searching  for  a 
suitable  E  €  G  such  that  the  term  fnE(v)  is  minimized,  i.e.,  for  which  /n£(w)  = 
/„( u>).  As  already  mentioned,  it  is  known  that  for  infinitely  many  iterations  the 
Elitist  model  of  GA  s  will  certainly  be  able  to  obtain  such  an  E. 

Theorem  :  For  sufficiently  large  n,  /»(w)  ^  a,  (i.e.,  for  sufficiently  large  n, 
/„( u>)  cannot  be  greater  than  a)  almost  everywhere. 

Proof :  Let  Yi(u)  =  1  if  Xi(u)  is  misclassified  according  to  Bayes  rul  Vu  €  ft. 

=  0  otherwise. 

Note  that  Yi,Y2,  . . . ,  Yn, . . .  are  i.i.d  random  variables.  Now 

k 

Prob(Yi  =  1)  =  £  Prob{Yi  =  \/Xi  is  in  Cj)P{Xi  is  in  Cj) 

;=i 

k 

=  52  PjProb(u  :  Xi(u)  €  given  that  u  €  Cj) 

j=i 

^  r 

=  /  Pj(x)dx  =  a. 

j= 1  JS0) 

Hence  the  expectation  of  Yi,  E(Yi)  is  given  by 

E{Yi)  =  a,  Vi. 
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Then  by  using  Strong  Law  of  Large  Numbers  [11],  ^Yi  — *  a  almost  every- 

i=i 

where. 

i.e.,  P(w  :  /— >•  a)  =  0. 

i=i 

Let  B  =  { u:  JjStfw)  — >  a}  C  f2.  Then  Q(B)  =  1. 

!=1 

Note  that  /„(u;)  <  £  E’/=i  ^(w),  Vn  and  Vw,  since  the  set  of  regions 
(50i,  5o2,-..,  50fc)  obtained  by  the  Bayes  decision  rule  is  also  provided  by  some 
A  G  A  and  consequently  it  will  be  included  in  G.  Note  that  0  <  /„(w)  <  1, 
Vn  and  Vw.  Let  u  €  B.  For  every  u  €  B,  U(u)  =  {/„(w);n  =  1,2, ...}  is  a 
bounded,  infinite  set.  Then  by  Bolzano- Weierstr ass  theorem  [12],  there  exists 
an  accumulation  point  of  U{u>).  Let  y  =  Sup{yo  :  Vo  is  an  accumulation  point 
of  U(u)}.  From  elementary  mathematical  analysis  we  can  conclude  that  y  <  a, 

n 

since  — >  o  almost  everywhere  and  /n(w)  <  ^  H"=1  Fi(w).  Thus  it  is 

i=i 

proved  that  for  sufficiently  large  n,  /n(w)  cannot  be  greater  than  a  for  lv  €  B. 

♦ 

It  is  to  be  mentioned  that  the  theorem  proved  earlier  indicates  that  as  the 
size  of  the  training  data  set  is  increased,  the  performance  of  the  SA  classifier 
will  approach  that  of  the  Bayes  classifier.  The  fact  that  /n(w)  <  a  is  true  for 
only  a  finite  number  of  sample  points,  since  many  distributions  can  generate 
these  points.  However,  as  the  size  of  the  data  set  goes  to  infinity,  only  one 
distribution  can  possibly  generate  all  the  points  [13].  Also,  since  we  know  that 
Bayes  classifier  is  the  optimal  one  in  a  statistical  framework,  and  there  can  be  no 
better  classifier,  the  above  mentioned  claim  (that  /„(w)  <  a)  can  only  indicate 
that  /»(<*>)  =  a\  or  in  other  words,  the  performance  of  the  SA  classifier  will 
tend  to  that  of  the  Bayes  classifier  in  the  limiting  case.  This,  in  turn  indicates 
that  under  limiting  conditions,  the  boundary  provided  by  the  SA  classifier 
will  approach  the  bayes  boundary.  This  is  experimentally  demonstrated  in 
Section  5.4. 

Note  :  The  term  ‘sufficiently  large’  is  borrowed  from  statistics  books  and 
indicates  mathematical  term  ►  oo’. 
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5  Implementation  and  Results 


The  three  data  sets  used  for  demonstrating  the  effectiveness  of  the  SA  classifier 
are  the  following  : 

ADS  1  :  This  two  dimensional  artificial  data  set  (Fig.  3)  consists  of  557 
data  points  belonging  to  two  classes.  It  is  evident  that  the  classes,  which  are 
separable,  have  non  linear  class  boundary. 

Vowel  Data  :  This  real  life  speech  data  consists  of  871  Indian  Telugu  vowel 
sounds  in  six  classes  represented  by  {6,a,i,e,o,u}  [14].  It  has  three  features 
corresponding  to  the  first,  second  and  third  formant  frequencies.  Fig.  4  shows 
the  data  set  in  the  first  and  second  formant  frequency  plane. 

Iris  data  :  This  four  dimensional  data  set  for  a  specific  category  of  irises  has 
150  points  in  three  classes  [15] .  The  features  correspond  to  the  sepal  width  and 
length  and  petal  width  and  length  in  centimeters. 

Data  Set  1  :  This  two  dimensional  data  set,  used  for  verifying  the  theoretical 
result  in  Section  4,  is  generated  using  a  triangular  distribution  for  the  two 
classes,  1  and  2.  The  range  for  class  1  is  [0,2]  x  [0,2]  and  that  for  class  2  is 
[1,3]  x  [0,2]  with  the  corresponding  peaks  at  (1,1)  and  (2,1).  If  Pi  is  the  a  priori 
probability  of  class  1,  then  using  elementary  mathematics,  we  can  show  that 
Bayes  classifier  will  classify  a  point  to  class  1  if  its  X  coordinate  is  less  than 
1  +  Pi.  This  indicates  that  the  Bayes  decision  boundary  is  given  by 

x  =  l  +  Pi.  (6) 


5.1  Performance  of  SA  classifier 

The  parameters  of  SA  are  as  follows  : 

Tmcx  =  100 
Tmi  n  =  0.01 
r  =  0.9 
k  =  100 
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Table  1:  Performance  during  training  of  SA  classifier  for  different  values  of  if 


using  10%  1 

training  data 

Data  Set 

Recognition  Score 

if  =  3 

if  =  4 

if  =  5 

if  =  6 

if  =  8 

ADS  1 

94.54 

98.18 

100.0 

100.0 

Vowel 

52.94 

74.71 

95.29 

96.65 

| 

Iris 

100.0 

100.0 

100.0 

100.0 

100.0 

Table  2:  Performance  during  testing  of  SA  classifier  for  different  values  of  if 


for  10%  training  and  90%  test  data 


Data  Set 

Recognition  Score 

if  =  3 

if  =  4 

if  =  5 

if  =  6 

if  =  8 

ADS  1 

93.02 

93.02 

88.64 

Vowel 

63.35 

65.60 

76.84 

74.55 

70.73 

Iris 

89.63 

93.33 

93.33 

93.33 

77.78 

Accordingly,  the  maximum  number  of  iterations  will  be  8800.  Inorder  to  gen¬ 
erate  a  new  string,  one  randomly  chosen  bit  is  flipped. 

The  results  shown  are  the  average  values  of  five  different  runs  of  the  algorithm. 

Table  1  shows  the  overall  training  performance  of  the  SA  classifier  for  data  sets 
ADS  !,  Vowel  and  Iris  using  five  values  of  if  when  10%  of  the  data  set  is  used 
for  training.  As  expected,  the  training  score  generally  improves  to  a  maximum 
of  100%  as  the  number  of  hyperplanes  is  increased,  since  more  hyperplanes  can 
readily  fit  the  training  data  set  to  reduce  the  number  of  misclassified  points. 
Note  that  because  of  the  considerable  amount  of  overlap,  for  the  Vowel  data , 
consideration  of  even  if  =  8  could  not  provide  zero  misclassification. 

Tables  2  and  3  show  the  test  results  of  the  SA  classifier  for  these  three  data 
sets,  for  five  values  of  if,  when  10%  and  30  %  of  the  data  set  are  used  for  train¬ 
ing  while  the  remaining  90%  and  70%  data  are  used  for  testing  respectively. 
Unlike  the  training  performance,  the  test  recognition  score  improves  initialy  as 
if  is  increased  upto  a  specific  value,  beyond  which  the  score  decreases.  For 
example  consider  H  —  6  and  if  =  8  of  Table  2  for  ADS  1  where  the  score  de- 


14 


Table  3:  Performance  during  testing  of  SA  classifier  for  different  values  of  H 


for  30%  training  and  70%  test  data 


Data  Set 

Recognition  Score 

H  =  3 

H  =  4 

H  =  5 

H  =  6 

H  =  S 

ADS  1 

91.28 

96.92 

98.72 

96.41 

96.20 

Vowel 

65.60 

67.48 

75.98 

75.00 

79.90 

Iris 

93.33 

95.23 

94.28 

91.42 

94.28 

creases  during  testing,  although  it  remained  constant  (at  100%)  during  training 
(table  1).  This  indicates  that  H  =  8  leads  to  overfitting  of  the  classes  during 
training,  thereby  reducing  the  generalization  capability  of  the  classifier  during 
testing.  Similar  is  the  case  for  H  =  6  and  8  for  ADS  1  in  Table  3.  As  expected, 
the  overall  recognition  capability  of  the  classifier  increases  when  the  size  of  the 
training  data  set  is  increased  from  10%  in  Table  2  to  30%  in  Table  3. 


5.2  Replacing  Simulated  Annealing  with  Genetic  Algo¬ 
rithm 

Genetic  Algorithm  (GA)[6]  is  another  evolutionary  search  paradigm,  based  on 
the  principles  of  natural  genetic  systems  and  survival  of  the  fittest.  Like  SA, 
GAs  also  generally  work  with  a  binary  string  encoding  of  the  parameters  of 
the  search  problem.  Instead  of  dealing  with  a  single  string  or  chromosome ,  it 
operates  on  a  number  of  strings  termed  population.  A  fitness  value,  which  is 
maximized,  is  associated  with  each  string  which  represents  the  degree  of  good¬ 
ness  associated  with  it.  Several  biologically  inspired  operators  like  selection , 
crossover  and  mutation  are  applied  iteratively  over  a  number  of  generations 
to  generate  potentially  better  solutions.  Termination  is  achieved  if  eitbei*  a 
mavimiim  number  of  iterations  has  been  executed  or  a  user  specified  criterion 
is  satisfied.  Details  of  the  method  can  be  found  in  [4,  6]. 

The  fitness  computation  method  is  the  same  as  the  process  of  calculating  the 
energy  associated  with  a  string  (see  Section  3.2).  Roulette  wheel  selection  strat¬ 
egy,  single  point  crossover  strategy  with  probability  0.8  and  bit  wise  mutation 
with  a  variable  mutation  probability  value  in  the  range  [0.015,0.333]  [5]  for  a 
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Table  4:  Comparative  Performance  of  SA  and  GA  for  classification  for  H  =  6 


Data  Set 

GA 

SA 

iter. 

score 

iter. 

score 

ADS  1 

512 

93.22 

5815 

93.02 

Vowel 

- 

71.99 

- 

74.55 

Iris 

97 

93.33 

population  size  of  20  are  chosen  for  the  GA.  The  maximum  number  of  iterations 
is  fixed  at  1500.  The  comparative  performance  (in  terms  of  both  the  test  score 
and  number  of  iterations  required  for  attaining  zero  misclassification  during 
training)  of  SA  and  GA  for  the  classification  problem  is  presented  in  Table  4, 
when  10%  data  is  considered  for  training  and  the  remaining  90%  for  testing. 
An  entry  field  indicates  that  zero  misclassification  could  not  be  achieved  even 
after  the  maximum  number  of  iterations  was  executed. 

As  is  evident  from  Table  4,  the  test  recognition  scores  of  both  GA  and  SA 
based  classifiers  are  comparable.  Although,  the  iterations  required  to  attain 
zero  misclassification  for  GA  is  less  than  that  for  SA,  the  number  of  string 
evaluations  is  much  more  since  one  iteration  of  GA  corresponds  to  a  maximum 
of  20  strings,  which  is  the  size  of  the  population.  On  the  other  hand,  exactly  one 
new  string  is  evaluated  in  each  iteration  of  SA.  On  this  count,  GA  requires  at 
most  10240  and  440  string  evaluations  for  ADS  1  and  Iris  respectively,  which 
is  significantly  more  than  that  required  in  SA.  However,  one  must  note  that  of 
the  10240  (or  440)  strings  evaluated  by  GA  for  ADS  1  (or  Iris)  there  will  be 
many  replications.  In  fact,  only  a  relatively  small  fraction  of  the  strings  will  be 
unique. 


5.3  Comparison  with  other  classifiers 

The  performance  of  the  SA  classifier  is  compared  to  Bayes  maximum  likelihood 
classifier,  Multilayered  Perceptron  (MLP)  and  k-NN  rule.  Both  MLP  (with  hard 
delimiters)  and  k-NN  rule  are  known  to  provide  piecewise  linear  boundaries, 
which  is  the  underlying  philosophy  of  the  SA  classifier.  } 

> 
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Table  5:  Comparai 

tive  Test  Per 

formance  with  1( 

Data  Set 

SA  classifier 
for  H  =  6 

Bayes  max. 
like,  class. 

MLP 

k-NN 
k  =  N/n 

ADS  1 

93.02 

85.65 

82.47 

90.23 

Vowel 

74.55 

77.73 

60.30 

70.35 

Iris 

93.33 

83.22 

74.81 

90.37 

Training  Data 


k-NN  algorithm  is  executed  taking  k  equal  to  %/n”,  where  n  is  the  number  of 
training  data  points.  It  can  be  proved  that  for  such  a  form  of  k,  the  error 
probability  of  the  k-NN  rule  approaches  the  Bayes  error  probability.  For  the 
Bayes  maximum  likelihood  classifier,  unequal  dispersion  matrices  and  unequal 
a  priori  probabilities  (=  ^  for  n{  patterns  from  class  i),  are  considered.  In 
each  case,  we  assume  a  multivariate  normal  distribution  of  the  samples. 

For  MLP,  learning  rate  and  momentum  factor  are  0.9  and  0.1  respectively. 
Online  connection  weight  updation,  i.e.,  updation  after  the  presentation  of  each 
training  data  point,  is  performed.  A  maximum  of  10000  iterations  are  allowed. 
The  network  architectures  for  ADS  1 ,  Vowel  and  Iris  data  sets  are  2-5-2,  3- 
8-6  and  4-5-3  respectively,  where  the  first  and  the  last  numbers  represent  the 
number  of  nodes  in  the  input  and  output  layers,  and  the  intermediate  number  (s) 
represent  the  number  of  nodes  in  the  hidden  layer (s). 

The  results  in  Table  5  show  that  the  SA  classifier  provides  superior  perfor¬ 
mance  to  all  the  other  classifiers  for  both  ADS  1  (where  k-NN  is  known  to 
perform  well)  and  Iris.  For  the  Vowel  Data ,  the  result  of  the  Bayes  classifier 
is  the  best.  In  fact,  the  Bayes  classifier  is  known  to  perform  well  for  this  data 
[14].  In  this  case  also,  the  recognition  score  of  the  SA  classifier  is  found  to  be 
closer  to  the  Bayes  score  as  compared  to  MLP  and  k-NN. 


5.4  Empirical  Verification  of  the  Theoretical  Result 

As  a  consequence  of  Theorem  in  Section  4,  the  boundary  provided  by  the  SA 
classifier  approaches  the  Bayes  boundary  under  limiting  conditions.  Fig  5  (a- 
c)  demonstrates  that  this  is  indeed  the  case  for  the  Data  Set  1.  The  Bayes 
boundary  is  a  straight  line  x  =  1.4.  The  SA  line  is  marked  with  an  arrow. 
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Fig  5  (a),  (b)  and  (c)  show  the  SA  lines  obtained  for  n  =  100,  1000  and  4000 
respectively.  Only  100  data  points  are  plotted  in  the  figures  for  clarity.  It  is 
obvious  from  the  figures  that  as  n  increases  from  100  to  4000,  the  SA  line  also 
approaches  the  Bayes  line,  so  much  so,  that  for  n  =  4000,  they  lie  very  close  to 
each  other. 


6  Discussion  and  Conclusions 


A  pattern  classification  methodology  in  using  simulated  annealing  for 
search  and  placement  of  a  number  of  hyperplanes  in  order  to  approximate  the 
class  boundaries  of  a  given  training  data  set,  has  been  described.  An  extensive 
comparison  of  the  methodology  with  other  classifiers,  namely  the  Bayes  classi¬ 
fier  (which  is  well  known  for  discriminating  overlapping  classes),  k-NN  classifier 
and  MLP  (which  are  well  known  for  discriminating  non-overlapping,  non-linear 
regions  by  generating  piecewise  linear  boundaries)  is  also  presented.  The  re¬ 
sults  of  the  proposed  algorithm  are  seen  to  be  comparable  to,  sometimes  better 
than,  them  in  discriminating  both  overlapping  and  non-overlapping,  non-convex 
regions. 

A  distinguishing  feature  of  this  approach  is  that  the  boundaries  (approximated 
by  piecewise  linear  segments)  need  to  be  generated  explicitly  for  making  deci¬ 
sions.  This  is  unlike  the  conventional  methods  or  the  multilayered  perceptron 
(MLP)  based  approaches,  where  the  generation  of  boundaries  is  a  consequence 
of  the  respective  decision  making  processes. 

A  theoretical  analysis  of  the  aforesaid  classifier  establishes  that  under  limiting 
conditions  of  infinitely  large  training  data  sets,  the  error  probability  of  the  SA 
classifier  during  training  is  less  than  or  equal  to  that  of  the  Bayes  classifier. 
This,  in  turn,  indicates  that  when  the  size  of  the  training,  data  set  goes  to 
infinity,  the  boundary  provided  by  the  SA  classifier  approaches  the  Bayes 
boundary.  This  finding  is  also  experimentally  verified  for  a  data  set,  generated 
using  triangular  distribution,  where  the  Bayes  boundary  is  known  exactly. 

A  comparison  of  SA  with  GA  for  this  classification  problem  shows  that  both 
perform  comparably  in  terms  of  the  test  recognition  scores.  This  is  expected, 
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since  both  are  stochastic  optimization  techniques,  working  on  the  same  principle 
of  approximating  the  class  boundaries  using  a  number  of  hyperplanes.  In  terms 
of  string  evaluations  required  to  obtain  the  optimal  performance,  SA  appears 
to  score  over  GA.  However,  one  must  note  that  it  is  very  difficult  to  obtain 
the  actual  number  of  distinct  string  evaluations  in  GA,  since  strings  are  often 
replicated.  The  actual  number  of  distinct  evaluations  will,  in  fact,  be  a  small 
fraction  of  the  quantity  (population  size  x  number  of  iterations) . 

Although  SA  is  found  to  perform  comparably  to  GA,  there  appear  to  be  several 
factors  contributing  to  the  predominance  of  GAs  in  the  literature.  L.  SA,  two 
main  control  parameters  are  to  be  selected  appropriately  in  order  to  obtain 
good  performance.  These  are  the  values  of  r  (which  controls  the  sequence 
of  T)  and  k  (the  number  of  iterations  executed  at  each  temperature).  On 
the  other  hand,  in  GA,  only  the  maximum  number  of  iterations  (or  stopping 
time)  must  be  appropriately  selected.  Other  than  this,  both  the  methods  need 
proper  tuning  of  several  other  parameters,  e.g.,  Tmax,  Tmin  in  SA,  probabilities 
for  crossover,  mutation  in  GA,  etc.  Additionally,  in  the  advanced  stages  of 
the  SA  algorithm,  the  temperature  values  should  be  smaller  than  the  smallest 
difference  of  the  energy  values  in  order  to  provide  good  performance.  Since  for 
the  pattern  classification  problem,  this  value  is  1  (minimum  non  zero  difference 
of  number  of  misclassified  points)  and  Tmin  =  0.01,  this  requirement  is  met.  GA, 
with  roulette  wheel  selection,  is,  on  the  other  hand,  immune  to  this  difference. 
Finally,  since  SA  is  inherently  sequential  in  nature,  not  much  improvement 
can  be  derived  in  parallel  computing  platforms,  while  there  is  scope  for  such 
improvement  in  GA.  One  must  note  that  very  basic  versions  of  both  SA  and 
GA  are  used  here.  Use  of  enhanced  models  and  improved  operators  for  botl* 
SA  and  GA  may  provide  better  performance.  For  example,  in  case  of  SA,  Qther 
cooling  schedules  [16,  17]  may  be  used.  Similarly,  modified  versions  of  GA,  like 
GACD  (genetic  algorithm  with  chromosome  differentiation),  may  be  applied 
which  has  been  found  to  improve  the  classification  performance  [18]. 
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Figure  2:  Steps  of  the  SA  classifier 
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FIGURE  CAPTIONS 


Fig.  1  -  Steps  of  Simulated  Annealing. 

Fig.  2  -  Steps  of  the  SA  classifier. 

Fig.  3  -  Artificial  data  set  ADS  1. 

Fig.  4  -  Real  life  speech  data,  Vowel  data ,  in  the  first  and  second  formant 
frequency  planes. 

Fig.  5(a)  -  Data  Set  1  for  n  =  100  and  the  boundary  provided  by  SA 

classifier  (marked  with  an  arrow)  along  with  Bayes  decision  boundary.  Class 
1  is  represented  by  *+’  and  class  2  by  ‘o’. 

Fig.  5(b)  -  Data  Set  1  for  n  =  1000  and  the  boundary  provided  by  SA 
classifier  (marked  with  an  arrow)  along  with  Bayes  decision  boundary.  Class 
1  is  represented  by  *+’  and  class  2  by  ‘o’. 

Fig.  5(c).  -  Data  Set  1  for  n  =  4000  and  the  boundary  provided  by  SA 
classifier  (marked  with  an  arrow)  along  with  Bayes  decision  boundary.  Class 
1  is  represented  by  *+’  and  class  2  by  ‘o’. 
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